Nature Machine Intelligence — Latest Matching Preprints

1

Towards A Foundation Model for Clinical Voice Biomarkers

Elemento, O.; Sigaras, A.; Colonel, J.; Hajirasouliha, I.; Ghosh, S.; Bensoussan, Y.; Bridge2AI-Voice Consortium, ; Rameau, A.

2026-05-30 health informatics 10.64898/2026.05.28.26354346 medRxiv

Top 0.1%

12.4%

Show abstract

Vocal biomarkers, encompassing voice and speech, have largely been developed for individual conditions in isolation, limiting their generalizability across diseases and recording settings. To address this, we introduce VoiceFM, a contrastive model that learns general-purpose clinical voice representations by aligning audio embeddings with rich clinical metadata. Using the Bridge2AI-Voice dataset (984 primarily English-speaking adult participants, 846 used for training and 138 held out as a temporally separated validation cohort, 40,056 recordings totaling 176 hours across 5 academic medical centers), VoiceFM pairs a fine-tuned Whisper large-v2 encoder with a tabular transformer over 44 clinical features via symmetric InfoNCE loss. Linear probes on frozen VoiceFM embeddings achieve mean AUROC 0.952 +/- 0.005 across five evaluation tasks (control vs disease screening plus four disease categories), significantly outperforming Frozen Whisper (0.926 +/- 0.013, p = 0.013), Frozen HuBERT (0.885 +/- 0.017, p = 0.0009), and the contrastively trained VoiceFM-HuBERT (0.938 +/- 0.006, p = 0.012). On the 138-participant held-out cohort, VoiceFM-Whisper achieves AUROCs of 0.99 for Alzheimer's/dementia/MCI and 0.89 for airway stenosis, demonstrating that the learned representations generalize to participants the model has never seen. VoiceFM representations transfer to three external datasets without retraining and improve few-shot classification. Recording task attribution identifies a small set of speech tasks that match or exceed the full battery's performance, suggesting shorter screening protocols are feasible. Trained predominantly on English audio, VoiceFM transfers without fine-tuning to Spanish-language Parkinson's disease (PD) detection (NeuroVoz, 107 participants, AUROC 0.93 +/- 0.02), with the signal dominated by articulatory rather than phonatory features. A fine-tuned classifier achieves participant-level AUROC 0.87 (sustained 0.85, countdown 0.80) on the mPower smartphone study (585 held-out participants). Together, these results show that contrastive alignment between voice and rich clinical metadata can serve as the basis for a clinical voice foundation model, producing a single set of transferable representations that generalize across diseases, languages, recording conditions, and patients enrolled after model freeze.

2

Personalized clinical reference intervals for routine precision medical care

Zhang, C.; Chen, Y.-L.; Jamilov, A.; Liu, E.; Shree, S.; Lam, B. D.; Foy, B. H.

2026-05-30 health informatics 10.64898/2026.05.28.26354363 medRxiv

Top 0.4%

6.3%

Show abstract

Most routine clinical markers are interpreted using population-based reference intervals, despite being regulated around patient-specific homeostatic setpoints. This mismatch obscures physiologic shifts, inhibiting detection of early disease signatures. Here, we develop a novel Bayesian inference method that adaptively constructs personalized reference intervals using each patients existing health records. In analysis of >100 million lab tests in >800,000 patients, these personalized intervals can be accurately constructed with only minimal prior data, meaning this method can be applied near universally. We show that across 43 common lab markers, patient setpoints are strongly associated with future morbidity, with signal strength increasing as more test data is collected. Deviation from personalized reference intervals provides strong and novel risk signatures across diverse disease states, including hypothyroidism, hematologic cancers, kidney disease, and pregnancy complications. Importantly, personalized reference intervals capture a different risk signature to existing population-based approaches, with the highest risk patients being those who deviate from both intervals simultaneously. In a targeted clinical use case study of iron infusion, use of personalized reference intervals greatly improved prediction of treatment efficacy and allowed precise tracking of treatment responses. Our results illustrate how existing health records can be used to construct personalized benchmarks for nearly all common clinical tests, driving a new paradigm for precision laboratory medicine.

3

DISCERN: A Clinical Impact-aware Framework for Radiology Report Comparison

Sharma, R.; Beeche, C.; Dong, J.; Zhuang, R.; Qu, H.; Zhang, R.; Gangaram, V.; Goswami, P.; Xin, J.; Ballard, J.; Goldberg, A.; Sagreiya, H.; Long, Q.; Chen, T.; Witschey, W. R.

2026-05-27 radiology and imaging 10.64898/2026.05.26.26353612 medRxiv

Top 0.7%

4.4%

Show abstract

The surge in medical imaging has spurred the development of vision-language models (VLMs) to alleviate radiologist workloads. However, clinical deployment is hindered by the lack of meaningful evaluation frameworks. Current metrics - ranging from semantic similarity to large language model (LLM) based judges - often fail to distinguish between clinically trivial and critical discrepancies, poorly reflecting real-world clinical judgment. To address this, we introduce DISCERN (Discordance and Significance-aware Entity-level Radiology Report Comparison). DISCERN is a significance-aware framework that weighs report errors based on their potential impact on patient care. Our results demonstrate that DISCERN powered by closed source LLMs aligns more closely with expert radiologist assessments than traditional metrics or current LLM evaluators, providing a more interpretable and clinically relevant benchmark. By modeling radiologist prioritization and entity-level feedback, DISCERN facilitates targeted model refinement and ensures the safer integration of generative AI into clinical workflows.

4

TopBrain Segmentation Challenge for Whole Brain Vessel Anatomy

Yang, K.; Shi, P.; Huang, H.; Musio, F.; Baazaoui, H.; Aydin, O. U.; Hilbert, A.; Hamadache, R. E.; Yalcin, C.; Zhang, M.; Falcetta, D.; de la Rosa, E.; Shit, S.; Prabhakar, C.; Wittmann, B.; Rokuss, M. R.; Kirchhoff, Y.; Al-Maskari, R.; Hoeher, L.; Juchler, N.; Casamitjana, A.; Cleary, J.; Schmick, A.; Baumgartner, P.; Deseoe, J.; Vandans, O.; Lee, D.; Oh, K.; LaBella, D.; Mazher, M.; Niederer, S. A.; Qayyum, A.; Liu, Y.; Chen, J.; Kim, W.; Asawalertsak, N.; Kim, M.; Shin, D.; Park, S.-H.; Kikuchi, S.; Zhang, Y.; Liu, J.; Cui, Y.; Qiu, Y.; Verschuur, A.; Zhang, J.; van der Schaaf, I.; Su, R.;

2026-05-30 radiology and imaging 10.64898/2026.05.28.26354312 medRxiv

Top 0.7%

4.3%

Show abstract

We present the TopBrain 2025 Challenge, the first benchmark for fine-grained multiclass segmentation of the whole brain vasculature in both computed tomography angiography (CTA) and magnetic resonance angiography (MRA). Building on the TopCoW challenge, TopBrain scales vessel annotation from the Circle of Willis to the entire brain, introducing a dataset of 90 annotated volumes across 48 landmark vessel classes spanning arterial and venous systems, of which 50 training volumes are publicly released. Vessel definitions were consolidated from established neuroanatomical references into a unified annotation scheme, and vessel caliber measurements along the centerline are reported for the first time across the whole brain vascular anatomy. To address the unique challenges of multiclass brain vessel segmentation, we propose an evaluation framework that accounts for detection in segmentation performance, assesses anatomical plausibility, and introduces novel contamination metrics that characterize inter-class prediction errors. Fifteen teams from over 220 registered participants submitted algorithms to the benchmark. The top-performing teams built on nnUNet with principled system design choices, achieving around 80% Dice scores, near-zero invalid neighbor counts, over 60% F1 scores for side-road vessels, and below 18% foreground contamination ratio. Larger vessels are easier to segment, while smaller and more complex vessels remain the true bottleneck. The annotated datasets and podium-finish algorithms are made publicly available on Zenodo.

5

High Resolution Multi-depth Quantification of the Retinal Nerve Fiber Layer

Callet, C.; Bertrand, M.; Guzman, K.; Mece, P.; Rossi, E. A.; Grieve, K.

2026-06-01 ophthalmology 10.64898/2026.05.22.26353127 medRxiv

Top 0.9%

3.7%

Show abstract

The retinal nerve fiber layer, composed of axon bundles converging toward the optic nerve, is a key biomarker for diagnosing and monitoring glaucoma and other neurodegenerative diseases. High-resolution en face imaging of individual nerve fiber bundles offers morphological information beyond what conventional optical coherence tomography provides, yet clinical integration remains limited by the lack of automated analysis tools and normative data. Here, we imaged 14 healthy volunteers using time-domain full-field optical coherence tomography and adaptive optics scanning laser ophthalmoscopy, and developed automated pipelines to quantify bundle width, trajectory, tortuosity, and orientation. Bundles were on average 25% wider at shallower retinal depths, width measurements were consistent across imaging modalities, and estimated axon count per bundle decreased significantly with age. Global trajectory analysis revealed systematic deviations of high resolution data from existing mathematical models, particularly in the temporal sector, leading us to propose two refined trajectory models. These normative results provide a foundation for high resolution biomarkers for use in investigations of retinal neurodegeneration.

6

Subtype Dynamics Reveal Horizon-Dependent Structure in Influenza Predictability

Mao, Y.; Lopman, B.; Koelle, K.; Lau, M. S.

2026-05-30 epidemiology 10.64898/2026.05.28.26354347 medRxiv

Top 1%

3.0%

Show abstract

Accurate forecasting of seasonal influenza is critical for public health preparedness, and data-driven models are central to this effort. However, most approaches rely on aggregate indicators of influenza-like-illness (ILI), which can obscure heterogeneity and limit predictability at longer horizons. While subtype dynamics are well established, their role in data-driven forecasting remains incompletely understood. Here, we integrate subtype-resolved surveillance data into diverse data-driven frameworks using over a decade of U.S. surveillance records to evaluate and decompose predictive signal in influenza forecasting. Across pre- and post-COVID-19 periods, subtype-informed models consistently improve over baseline models trained on aggregate ILI alone, with the largest gains at longer horizons. Decomposition reveals a horizon-dependent reorganization of predictability: autoregressive persistence in recent aggregate incidence dominates at short horizons but declines with lead time, while predictive signal shifts toward subtype-derived structure. Within this structure, interaction-related features among co-circulating subtypes grow systematically with forecast horizon, indicating that longer-term predictability is driven increasingly by interaction structure rather than marginal subtype composition alone. Together, our results show that subtype information provides non-redundant predictive signal and extends the effective forecasting window of data-driven models. More broadly, our findings suggest that aggregation of heterogeneous subtype processes can obscure latent predictability, supporting subtype-resolved surveillance.

7

Deep learning optimisation for cardiology: Neural Architecture Search-driven arrhythmia classification with electrocardiograms

Vanegas Mueller, E.; Joe-Oshodi, A.; Banerjee, A.; Villarroel, M.

2026-05-30 cardiovascular medicine 10.64898/2026.05.28.26354348 medRxiv

Top 1%

2.6%

Show abstract

Cardiovascular disease is the leading cause of death worldwide. Sudden cardiac death (SCD) accounts for roughly 50% of all cardiac deaths. The electrocardiogram (ECG) is widely used for early diagnosis of cardiac disease. However, the complexity of accurate interpretation limits the ECG's efficacy. Modern deep learning methods have been applied to assist clinicians in diagnosis. We applied Neural Architecture Search (NAS), an automated machine learning technique, to identify optimal deep learning architectures for classifying cardiac arrhythmias from ECGs. We applied the Differentiable Architecture Search strategy to an AutoFormer search space to identify optimal self-attention architectures for arrhythmia classification. We trained, validated, and tested the resulting model on the PhysioNet Challenge 2021 dataset (n = 88,253), comprising ECGs across three continents. We performed a hyperparameter optimisation on the NAS output, exploring input patch size, class weighting, and loss function. We evaluated performance using the PhysioNet Challenge metric and the area under the receiver operating characteristic curve (AUROC). The NAS converged towards minimal architectural configurations (embedding dimension: 384, depth: 4, self-attention heads: 4, MLP ratio: 1) with a validation challenge metric of 0.66 (PhysioNet Challenge 21 Winner: 0.63). The NAS-created network achieved an AUROC of 0.97 and a challenge metric of 0.71 during testing. Normal Sinus Rhythm and Sinus Tachycardia achieved AUROCs of 0.99. Low-QRS Voltage and T-wave abnormality were the worst-performing arrhythmias, with AUROCs of 0.89 and 0.90, respectively. We interpret that architectural simplicity drives performance in arrhythmia classification. Because SCD is unexpected, prevention strategies in free-living environments require lightweight computational resources suitable for wearable devices. Class imbalance fundamentally limits classification performance for rare arrhythmias such as Low-QRS Voltage and T-wave inversion, irrespective of hyperparameter choices. However, the self-attention mechanism can autonomously abstract clinical representations, simplifying clinical deployment by eliminating the need for an explicit feature-extraction pipeline.

8

Personalized Brain-Based Analgesia Detection with Portable fNIRS and AI

Minoccheri, C.; Joo, P.; Hu, X.-S.; Affendi, H.; Elayyan, F.; Harville, A.; McDonald, N. J.; Botero, T.; DaSilva, A. F.

2026-05-28 dentistry and oral medicine 10.64898/2026.05.20.26353377 medRxiv

Top 2%

2.0%

Show abstract

Neuroimaging based pain decoding faces two underappreciated challenges: between subject variability that prevents classifiers from generalizing across patients, and within session cross validation designs that inflate reported accuracy by conflating within person and between person variance. Here we address both using portable functional near infrared spectroscopy (fNIRS) during pharmacologically verified local nerve anesthesia. Twentyfive patients with clinically painful teeth underwent 36 channel bilateral fNIRS during percussion before ("Pre") and after ("Post") local nerve anesthesia. In 13 block-success patients, a paired Pre versus Post comparison with healthy tooth control identified three temporal hemodynamic response function (HRF) features (late slope, mean first derivative, and baseline normalized amplitude) whose analgesia interaction effects (d = 0.63 to 0.79) exceeded that of raw general linear model (GLM) amplitude (d = 0.56), with a significant difference-in-differences interaction (p = 0.011). Per-patient calibration with these features yielded leave one subject out (LOSO) AUC = 0.68 to 0.76 for nonlinear classifiers (permutation p = 0.002), with HbO-specific feature selection achieving the best performance (RF AUC = 0.760); a healthy tooth negative control was non-significant. End to end deep learning on raw time series (CNN LSTM AUC = 0.719) was competitive with feature based classifiers, while linear models did not reach significance. Critically, head to head comparison of within-session CV and LOSO on the same data revealed mean inflation of +0.13 AUC across all model types, including deep learning, demonstrating that high within session accuracy alone does not establish subject-independent validity. Exploratory analyses suggested complementary roles for oxyhemoglobin (HbO; within patient analgesia detection) and deoxyhemoglobin (HbR; cross patient information), and that trial to trial response variability may complement amplitude for cross patient pain detection. These results show that per patient calibration with temporal HRF features supports subject independent analgesic-state detection under strict LOSO evaluation, and that within-session validation (standard in the fNIRS pain- decoding literature) can substantially overestimate performance.

9

Normative Speech Modeling for ALS Diagnosis with Application to Other Neurodegenerative Diseases

Shah, M.

2026-05-27 neurology 10.64898/2026.05.25.26354057 medRxiv

Top 2%

1.9%

Show abstract

Amyotrophic lateral sclerosis (ALS) is a progressive neurodegenerative disease affecting more than 450,000 individuals worldwide and is frequently diagnosed more than 12 months after symptom onset, delaying intervention during a critical early window. Because up to 80% of patients develop dysarthria within two years, subtle changes in speech provide a signal of early bulbar motor neuron degeneration. However, existing speech-based systems rely on supervised classification trained on limited datasets, achieving moderate sensitivity and depending heavily on labeled disease examples, which restrict scalability and early detection. This study introduces SPEAK-NORM, the first-ever normative speech modeling framework for early ALS diagnosis, which learns age- and sex-conditioned motor-speech distributions exclusively from healthy individuals. A conditional variational autoencoder models coordination of hypoglossal, laryngeal, and respiratory motor pathways, and deviation from this healthy manifold is quantified through latent representations and reconstruction error to form a 354-dimensional profile. A calibrated linear Support Vector Machine performs subject-level classification under subject-disjoint validation. On the VOC-ALS database (n = 153), SPEAK-NORM achieves 98% accuracy with balanced sensitivity and specificity, significantly outperforming established clinical acoustic indices and prior systems. The framework maintains strong performance under cross-task generalization and when retrained on healthy controls in independent dementia and Parkinson disease cohorts, demonstrating disease-specific deviation patterns rather than generic neurodegenerative change. Spectral, temporal, and latent separations further support interpretability. By modeling healthy speech instead of memorizing disease examples, SPEAK-NORM enables scalable early neuromotor screening using recording devices, with potential to support earlier diagnosis, differential classification, and monitoring of ALS progression.

10

Tracking the Dynamic Trajectories: A Global-to-Local Pharmacovigilance Analysis of GLP-1 Receptor Agonists

Lu, S.; Ruan, X.; Wang, L.; Wang, X.; Sameer, M.; Liu, H.

2026-06-01 health informatics 10.64898/2026.05.28.26354401 medRxiv

Top 2%

1.8%

Show abstract

Although GLP1/GIP receptor agonists demonstrate unprecedented weight loss efficacy, their rapid clinical adoption has revealed significant real-world tolerability challenges. To evaluate their dynamic safety profiles, we developed a macro to micro pharmacovigilance framework by combining global FAERS reports with local UT Physician EHR. Macroscopically, we distilled 17 shared adverse events across the drug class from FAERS with disproportionality analysis. Microscopically, local EHR data (289,655 longitudinal treatment sessions across 71,316 patients) revealed 51.6% of GLP1 sessions terminated within 90 days. Furthermore, temporal stratified logistic regression demonstrated that initial exposure (0 to 30 days) correlated strongly with nausea and vomiting, which attenuated in extended sessions, whereas extended exposure (>2 years) uncovered late onset risks, notably incident hepatic steatosis. Ultimately, this time aware framework reveals that GLP1 safety profiles are profoundly duration dependent, providing critical insights into both acute intolerances and long-term medication safety.

11

Automated quantification of cerebral microbleeds for ARIA-H monitoring in Aging and Alzheimer's Disease: A multicenter deep learning validation

Low, Z. X. B.; Rowsthorn, E.; Nazem-Zadeh, M.-R.; Francis, M.; Robb, C.; Howcroft, M.; Whiriskey, R.; Brodtmann, A.; McNeil, J. J.; Law, M.

2026-05-26 radiology and imaging 10.64898/2026.05.19.26353364 medRxiv

Top 2%

1.7%

Show abstract

We trained a self-configuring nnU-Net model for CMB segmentation in a heterogeneous multicenter sample (n=264), including 1.5T and 3T field strengths, SWI and T2*-GRE sequences, and community and clinical cohorts. Model performance was evaluated using 5-fold cross-validation with a focus on object-level detection metrics. Real-world performance was evaluated on scans from an unseen dataset of people with cerebrovascular disease (n=20). The model achieved 0.82 cluster Dice, 0.88 precision, and 0.77 sensitivity on hold-out test data. Notably, the model demonstrated a low false-positive rate, averaging 0.58 false positives (FPs) per scan, an improvement on existing publicly available models. The model achieved high performance in dataset of those with Alzheimer's disease and mild cognitive impairment (0.89 cluster Dice, 0.94 sensitivity), supporting its utility in clinical settings where ARIA-H monitoring is critical. In external validation, the model maintained high robustness with 0.79 sensitivity and 0.95 FPs per scan. By leveraging a heterogenous training strategy and a self-adapting architecture, we demonstrate that deep learning can achieve high-precision CMB detection that is robust to domain shifts. The low FP rate suggests this publicly available pipeline is suitable for automated screening and lesion counting in heterogenous large-scale clinical trials, reducing the burden of manual quantification.

12

Closed-Loop Quality Assurance for Production Clinical AI Documentation

Napier, A.; Wiley, J.; Heslin, M.

2026-05-29 health informatics 10.64898/2026.05.27.26353977 medRxiv

Top 2%

1.6%

Show abstract

A closed-loop quality system deployed across thirteen US hospital sites resolved physician complaints with zero regressions on 42 tracked cases across 1,089 optimization iterations, while a deterministic assembly-agent replacement cut H+P trace latency from 19.6 s to 10.8 s (-8.8 s, 95% CI [-10.5, -7.1] s; n = 100 pre, n = 100 post). We report four observations and an architectural follow-through. First, the same binary-check instrument produces opposite outcomes depending on the question asked: "maximize this score" produces structurally-correct notes that physicians reject (Spearman rho = -0.077, 95% CI [-0.40, 0.26], n = 36); "did this specific fabrication stop?" produces rater-invariant deployment decisions. Second, in our pipeline, assembly-stage agents did not respond to prompt optimization the way reasoning agents did: four consecutive optimization attempts produced 18-28 point regressions. Third, physician preference is rater-fragile at typical clinical-AI calibration sample sizes (Cohen's kappa = 0.028 between two board-certified physicians, 95% CI [-0.30, 0.36] on n = 35 overlapping pairs). Fourth, the architectural punchline: six weeks after the prediction, the LLM call at the chart-assembly step was replaced with a deterministic renderer (sub-500-character template plus sandboxed scripting), lifting the defect-free rate on a 51-case holdout from 49% to 84%. We introduce a Pareto-with-absolute-floors acceptance rule (multi-axis commit with severity-class categorical vetoes) as a methodological contribution distinct from scalar-reward acceptance in standard prompt-optimization frameworks. Cross-iteration rejection memory prevents the loop from re-proposing edits already rejected three or more times. A reproducibility bundle (anonymized ablation per-case counts, bootstrap-CI data, analysis scripts) is released under CC BY 4.0 at github.com/sayvant/SQS-Auditor-paper-data.

13

Deriving OCT-Equivalent Retinal Nerve Fiber Layer Thickness Maps from Fundus Photographs with Deep Learning Improves Glaucoma Diagnosis

Shi, L.; Shi, M.; Chung, I. Y.; Pasquale, L. R.; Shen, L. Q.; Wang, M.

2026-05-27 ophthalmology 10.64898/2026.05.26.26354047 medRxiv

Top 2%

1.4%

Show abstract

Purpose: To develop and evaluate a deep learning model that predicts optical coherence tomography (OCT)-equivalent retinal nerve fiber layer thickness (RNFLT) maps directly from color fundus photographs and to assess their diagnostic value for glaucoma detection. Design: Retrospective model development and evaluation study. Participants: 15,031 paired fundus photographs and spectral-domain OCT scans collected at Massachusetts Eye and Ear between 2011 and 2022. Methods: Paired fundus and OCT images were used to train a U-Net-based model to predict pixel-wise RNFLT maps with artifact-corrected supervision. Diagnostic performance was evaluated across single-modality models (fundus photos only, real RNFLT maps, predicted RNFLT maps) and multimodal fusion models (fundus + predicted RNFLT maps). Stratified analyses examined model performance across glaucoma severity and demographic subgroups. Glaucoma was defined based on standard criteria applied to Humphrey 24-2 visual field testing. Main Outcome Measures: Mean absolute error (MAE) and structural similarity index (SSIM) for RNFLT map prediction. Area under the ROC curve (AUC) and accuracy for glaucoma detection. Results: RNFLT map prediction achieved a MAE = 15.4 m and a SSIM = 0.65, measured against artifact-corrected RNFLT maps derived from OCT. For glaucoma detection, the predicted RNFLT-only classifier outperformed the fundus-only classifier (AUC 0.889 vs 0.883, p < 0.005; Accuracy 82.0% vs 78.0%), but performed worse than the real-RNFLT-only classifier (AUC 0.889 vs 0.903, p < 0.005). Multimodal fusion of fundus images with predicted RNFLT maps improved performance, achieving an AUC of 0.909, outperforming all single-modality inputs (p < 0.005 vs fundus-only, predicted-RNFLT-only, and real-RNFLT-only). Performance gains between the fundus-only and the multimodal classifier were greater in early-stage glaucoma compared to severe cases: accuracy increased from 55.3% to 64.0% in mild cases, from 71.5% to 80.4% in moderate cases, and from 90.0% to 94.6% in severe cases. Conclusions: Predicted RNFLT maps derived from fundus photographs provide quantitative, OCT-like structural information and improve glaucoma detection. Unlike prior work that predicted only summary RNFLT values, our model generates full RNFLT maps that better support glaucoma classification than fundus images alone. This approach offers a scalable pathway for early glaucoma screening and expands diagnostic access in resource-limited settings.

14

VOGeo-Gaze: Calibration-Free, Geometry-Aware Deep Learning for Real-Time Gaze Tracking in Clinical Video-Oculography

Zhao, J.; Ahmadi, S.-A.; Decker, J.; Zwergal, A.; Eulenburg, P. z.; Flanagin, V. L.; Wuehr, M.

2026-05-29 health informatics 10.64898/2026.05.27.26354254 medRxiv

Top 2%

1.3%

Show abstract

Quantitative eye movement analysis is important for neuro- logical diagnostics, yet existing video-oculography (VOG) systems typ- ically require calibration, device-specific settings, or accurate gaze la- bels. We present VOGeo-Gaze, a real-time, calibration-free, geometry- aware neural network that estimates gaze by reconstructing anatomi- cally meaningful eyeball parameters from image features. The method combines segmentation-driven projection geometry, a refraction-aware pupil correction module, and temporal anatomical stabilization, so gaze is derived from interpretable eye geometry rather than direct angular regression. Trained only on the public TEyeD dataset with weak gaze supervision, VOGeo-Gaze was evaluated on 116 clinical recordings from 17 patients and 19 healthy subjects using EyeSeeCam, a clinical gold- standard VOG system. It achieved median absolute angular errors of 0.33{whitebullet} horizontally and 0.35{whitebullet} vertically, with nearly 92% of recordings below 1{whitebullet} error while operating at >300 FPS. These results demonstrate sub-degree clinical gaze estimation without subject-specific calibration, camera intrinsics, or accurate gaze labels, providing a scalable and inter- pretable alternative to conventional VOG pipelines. Code is available at https://github.com/DSGZ-MotionLab/VOGeo-Gaze.

15

Bridging Acoustic and Semantic Spaces for Interpretable Voice Scoring via Zero-Shot Semantic Expansion

Hsiao, C.; Cheng, Y.-R.; Yang, C.-Y.; Hsu, F.-S.

2026-06-01 health informatics 10.64898/2026.05.29.26354442 medRxiv

Top 3%

1.2%

Show abstract

Subjective auditory-perceptual evaluation and uninterpretable deep learning models limit the clinical assessment of voice disorders. This study proposes a two-phase zero-shot framework to evaluate voice pathology. First, an Audio Spectrogram Transformer is fine-tuned on the Perceptual Voice Quality Database to generate an acoustic latent space. Second, Orthogonal Procrustes analysis maps these acoustic embeddings directly onto the semantic space of a pre-trained Sentence Transformer. The geometric alignment produced continuous semantic axes that outperformed a supervised machine learning baseline in regressing clinician-rated GRBAS (Grade, Roughness, Breathiness, Asthenia, and Strain) severity scales. Furthermore, these axes correlate with traditional acoustic measures, including Harmonics-to-Noise Ratio and local jitter, while remaining robust when applied to aperiodic signals by not requiring fundamental frequency extraction. Most importantly, the model achieved zero-shot semantic expansion, successfully evaluating voices using an untrained, natural clinical vocabulary beyond the GRBAS scale. External validation on the Voice ICarus Database confirmed cross-corpus stability and demonstrated the capacity for zero-shot differential phenotyping of specific etiologies, such as hypokinetic dysphonia and reflux laryngitis. By bridging acoustic and semantic latent spaces, this framework offers an objective, continuous, and transparent metric for evaluating voice quality using voice descriptive vocabulary.

16

Genome-wide discovery reveals 30 loci for choroidal thickness and uncovers potential causal links with angle-closure glaucoma

Lee, S. S.-Y.; Wang, C. A.; de Vries, V. A.; van Hemert, D. J.; Schulze, A.; Brandl, C.; Aman, A. M.; Alonso-Caneiro, D.; Choquet, H.; Gorski, M.; Hammond, C. J.; Heid, I. M.; Hunter, M. L.; Hysi, P.; Jiang, C.; Jonas, J.; Klaver, C. C.; Kneepkens, S.; Konig, S.; Lingham, G.; Luber, C.; Melton, P. E.; Pennell, C. E.; Ramdas, W. D.; Read, S. A.; Schuster, A. K.; Wang, Y. X.; Zimmermann, M. E.; International Glaucoma Genetics Consortium, ; Khawaja, A. P.; Gharahkhani, P.; MacGregor, S.; Guggenheim, J. A.; Mackey, D. A.

2026-05-27 ophthalmology 10.64898/2026.05.26.26354075 medRxiv

Top 3%

0.8%

Show abstract

The choroid is critical for maintaining vision and implicated in several ocular diseases, being the sole source of nutrients and waste removal for the outer retina. Genetic discovery can help elucidate the pathways through which choroidal features influence disease risk. Our meta-analysis of genome-wide association studies (n= 78,682 participants) identified 30 genomic regions, including 20 novel loci, associated with choroidal thickness. Findings suggest inflammatory and vascular processes drive choroidal thickness, with overlapping mechanisms shared with refractive error. Genome-wide independently significant SNPs accounted for 18.7% of the genetic variance in choroidal thickness. Mendelian randomisation analyses showed a causal effect of age-related macular degeneration on choroidal thickness, and suggest a bidirectional causal effect between choroidal thickness and primary angle-closure glaucoma. These findings provide insight into the shared genetic architecture and biological pathways linking choroidal thickness and related diseases.

17

An ECG foundation model for generalizable cardiac function prediction across the lifespan

Yang, Y.; Peracchio, L.; Mayourian, J.; Miller, T.; La Cava, W.

2026-05-27 health informatics 10.64898/2026.05.26.26354128 medRxiv

Top 4%

0.7%

Show abstract

Background Artificial intelligence-enhanced electrocardiography (AI-ECG) enables scalable, low-cost cardiac dysfunction screening, but existing models are annotation-intensive and predominantly adult-derived, leaving paediatric generalizability uncertain. Paediatric cohorts exhibit highly variable cardiac morphology and function compared to adults, which may be useful for learning generalizable AI-ECG models. Methods We pretrained ECG-Fyler on a predominantly paediatric, all-age cohort at Boston Children's Hospital (1992-2023), annotated with a cardiology-specific coding system (Fyler codes), and evaluated it on assessments from echocardiography (echo) and cardiac magnetic resonance (CMR) studies. We validated on an external adult cohort from Columbia University Irving Medical Center. Performance was benchmarked against several AI-ECG foundation models by AUROC across age groups, lesion types, and limited-data scenarios. Findings The pretraining cohort comprised 782,138 ECGs from 255,271 patients (median age: 10.9 years, IQR: [2.8-16.8]). Internal evaluation included 178,495 ECG-echo pairs (median age: 10.9 [3.7-17.0]) and 8,584 ECG-CMR pairs (median age: 20.7 [15.6-29.6]). External validation included 82,543 ECG-echo pairs from adults (median age: 64.0 [52.0-74.0]). ECG-Fyler improved AUROC across biventricular dysfunction and dilation tasks, with the largest gains in low-data settings. In internal validation, ECG-Fyler detected low left ventricular ejection fraction (LVEF [≤] 40%) from only 100 fine-tuning samples (AUROC: 0.80, 95% CI: [0.78-0.80]), outperforming other models (AUROC < 0.65) and improving with additional fine-tuning (AUROC: 0.94 [0.93-0.94]). Similar improvements were observed for CMR-derived LVEF, RVEF, and ventricular dilation. In external validation on adults, ECG-Fyler exhibited an AUROC of 0.83 (CI: [0.82-0.85]) for LVEF [≤] 40%. After fine-tuning on less than 10% of external data, LVEF [≤] 45% performance (AUROC: 0.87 [0.86-0.88]) outperformed a fully trained, site-specific prior model (AUROC: 0.85 [0.84-0.87]). Interpretation Pretraining on richly annotated, paediatric-dominant ECGs yields models that transfer efficiently across institutions and ages, supporting AI-ECG screening and triage when labels or imaging access are limited. Funding National Institutes of Health (R01LM012973); Kostin Innovation Fund, Boston Children's Hospital

18

Pre-pandemic blood profiles predict COVID-19 hospitalization and death a decade later

Jacobs, L. A.

2026-05-29 epidemiology 10.64898/2026.05.27.26354230 medRxiv

Top 4%

0.7%

Show abstract

COVID-19 risk scores developed during the pandemic relied on measurements contemporaneous with infection, leaving unresolved whether the metabolic and inflammatory vulnerability they capture pre-existed as a stable trait or was triggered by acute illness. Here, using 501,946 UK Biobank participants whose blood was drawn between 2006 and 2010---at least ten years before SARS-CoV-2 emerged---we show that baseline proteomic and metabolic profiles predict both COVID-19 hospitalization (2,783 events; C-statistic =0.676 [0.666--0.686]) and COVID-19 mortality (1,564 deaths; C-statistic =0.730 [0.701--0.760]) from parsimonious, regularized feature sets. The IL-1 pathway index (xIL1, +0.093) was independently selected for hospitalization but not mortality, while the IL-6 trans-signaling index (xIL6, + 0.040) was selected for mortality but not hospitalization---a differential pathway weighting corroborated by independent LightGBM/SHAP analysis and mirroring the subsequent success of tocilizumab (anti-IL-6R) and the limited efficacy of anakinra (anti-IL-1R) in reducing COVID-19 mortality in randomized trials conducted years later. The mortality model was additionally characterized by central adiposity (waist-hip ratio, +0.386), a respiratory compromise index (xRSP, +0.149), and prodromal cardiovascular disease (pCVD, +0.246). These findings establish that vulnerability to a novel pathogen is, in substantial part, a pre-existing and measurable prodromal state, with implications for pandemic preparedness and population-level risk stratification.

19

Distinguishing Age-specific Patterns in Comorbidities of Obstructive Sleep Apnea Using Real-World Data

Goodman, M. O.; Alex, R. M.; Sands, S. A.; Azarbarzin, A.; Batool-anwar, S.; Pavlova, M. K.; Epstein, L. J.; Redline, S.; Cade, B. E.

2026-05-28 epidemiology 10.64898/2026.05.20.26352336 medRxiv

Top 4%

0.7%

Show abstract

Obstructive sleep apnea (OSA) is associated with a wide range of comorbidities, but the extent to which these follow predictable, age-dependent patterns is not well understood. Identifying such patterns could provide insight into OSA heterogeneity and its links to physiological measures of OSA. We trained age-dependent topic models (ATM) on longitudinal electronic health records from 36,426 patients with OSA in the Mass General Brigham Biobank. ATM organizes incident diagnoses into distinct comorbidity "topics," whose age-specific disease loadings represent predictive patterns linking related diagnoses across the life course. We applied the trained model to compute individual-level topic scores in independent data: a cohort of 11,689 OSA cases and 22,695 matched controls, and a cohort of 6,220 patients with polysomnography (PSG)-derived physiological measures. We identified 19 distinct age-dependent comorbidity profiles, all significantly associated with OSA case status (FDR-adjusted p<0.05). Topics reflected recognizable clusters including metabolic, neuropsychiatric, and immune-mediated conditions, and several were distinguished by age-of-onset of key comorbidities, such as early- vs late-onset asthma. Seventeen of the 19 topics were significantly associated with at least one of 13 PSG-derived physiological measures, including associations between cardiometabolic topics and the apnea-hypopnea index, sleep apnea specific hypoxic burden, and respiratory event-specific heart rate burden. These findings indicate that age-dependent comorbidity patterns distinguish meaningful OSA subtypes with differing prognoses and endophenotype associations. ATM offers insight into complex OSA comorbidity and suggests that age-informed, topic-based stratification may improve individualized risk assessment, interpretation of PSG findings, and targeting of clinical interventions.

20

Developing and Evaluating Deep Learning Approaches for Visual Field Denoising in Glaucoma

Baek, J. S.; Lokhande, A.; Neuenschwander, D.; Shi, M.; Wang, M.

2026-06-01 ophthalmology 10.64898/2026.05.29.26354019 medRxiv

Top 4%

0.7%

Show abstract

Purpose To investigate the relative efficacy of nine distinct visual field (VF) denoising artificial intelligence (AI) methods and a pathology-aware AI strategy to discourage over-correction of glaucomatous defects. Design Retrospective study. Participants 87,940 paired visual field (VF) and optical coherence tomography (OCT) samples from a tertiary academic center. Methods Denoising models were trained on a separate VF-only dataset and evaluated on an independent structure-function dataset of paired VF-OCT samples. We implemented and evaluated nine distinct VF denoising strategies representing three broad categories: baseline measurements, self-supervised and image restoration models (including Noise2Noise, Noise2Void, and NAFNet), and latent variable compression-based models (autoencoders and variational autoencoders). All models were designed to reconstruct VF sensitivity maps. We then predicted retinal nerve fiber layer thickness (RNFLT) maps from the denoised VFs using a fixed, independently trained VF-to-RNFLT prediction model. Main Outcome Measures Predicted VF and RNFLT maps and resultant evaluation metrics. Results The raw VF baseline achieved a global R2 of 0.5468 and MAE of 16.83 um. Restoration-based models maintained or slightly improved concordance, with the pathology-aware NAFNet achieving the highest global R2 of 0.5485 and a comparable MAE of 16.82 um. In contrast, compression-based models degraded concordance, with CNN-VAE showing a significant reduction (R2 approximately 0.50). In severe glaucoma, concordance decreased across all methods; however, compression architectures exhibited disproportionately greater degradation compared with restoration-based approaches. Conclusions We present a comparative benchmark of AI-based VF denoising strategies paired with structure-function evaluation. While restoration-based models can reduce variability without loss of biological signal, latent compression risks attenuating clinically meaningful defects. Visually smoother fields are not necessarily more biologically accurate.